B&B with Local Heaps by nguidotti · Pull Request #1149 · NVIDIA/cuopt

nguidotti · 2026-04-27T11:22:35Z

In this PR, each best-first worker has its own local node heap, such that it push/pop nodes without synchronizing with other workers. Each best-first worker periodically steals a node from a random worker to keep the node distribution more or less balance across them. Additionally, each best-first worker has a (fixed) set of diving worker assigned to it, which are used for performing diving on its own nodes whenever possible. This essentially eliminates the need of the scheduler thread, freeing one additional thread to do something useful.

This also implements a compression scheme for vstatus using only 2bits per entry, which reduces the memory consumption by roughly 4x (previously was using int8_t per entry). Last, but not least, this PR replaces std::deque with a fixed-capacity circular_deque_t for the plunge/dive stacks and the idle-worker list.

MIPLIB results (GH200, 10min):

================================================================================
main (1, #1099) vs bnb-local-heap (2)
================================================================================

------------------------------------------------------------------------------------------------------------------------------
|                                        |       Run 1        |       Run 2        |     Abs. Diff.     |   Rel. Diff. (%)   |
------------------------------------------------------------------------------------------------------------------------------
| Feasible                                                 227                  228                   +1                 --- |
| Optimal                                                   75                   78                   +3                 --- |
| Solutions with <0.1% primal gap                          124                  130                   +6                 --- |
| Nodes explored (mean)                              4.866e+06            1.436e+07           +9.496e+06                +195 |
| Nodes explored (shifted geomean)                        6772            1.205e+04                +5275               +77.9 |
| Relative MIP gap (mean)                               0.3264               0.3415             +0.01506               +4.62 |
| Relative MIP gap (shifted geomean)                    0.1156               0.1131              -0.0025               -2.16 |
| Solve time (mean)                                      444.6                441.5               -3.054              -0.687 |
| Solve time (shifted geomean)                           221.5                219.1               -2.327               -1.05 |
| Primal gap (mean)                                      11.57                11.15              -0.4201               -3.63 |
| Primal gap (shifted geomean)                          0.6324               0.5604             -0.07203               -11.4 |
| Primal integral (mean)                                 32.63                33.02              +0.3805               +1.17 |
| Primal integral (shifted geomean)                      6.346                6.405             +0.05989              +0.944 |
------------------------------------------------------------------------------------------------------------------------------

In summary, we explored ~3x nodes in average` at the same time frame. The number of optimal solutions also increased by 3.

Checklist

I am familiar with the Contributing Guidelines.
Testing
- New or existing tests cover these changes
- Added tests
- Created an issue to follow-up
- NA
Documentation
- The documentation is up to date with these changes
- Added new documentation
- NA

Remove dependency on rmm::mr::device_memory_resource base class. Resources now satisfy the cuda::mr::resource concept directly. - Replace shared_ptr<device_memory_resource> with value types and cuda::mr::any_resource<cuda::mr::device_accessible> for type-erased storage - Replace set_current_device_resource(ptr) with set_current_device_resource_ref - Replace set_per_device_resource(id, ptr) with set_per_device_resource_ref - Remove make_owning_wrapper usage - Remove dynamic_cast on memory resources (no common base class) - Remove owning_wrapper.hpp and device_memory_resource.hpp includes - Add missing thrust/iterator/transform_output_iterator.h include (no longer transitively included via CCCL)

…nd deterministic mode. Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

… shared_ptr to avoid unnecessary copy. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…l crash in work-stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…queue for now. refactoring. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

… are present Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # cpp/src/utilities/cuda_helpers.cuh

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # ci/validate_wheel.sh

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # cpp/src/branch_and_bound/mip_node.hpp # cpp/src/branch_and_bound/pseudo_costs.cpp

Kh4ster

Very cool results! Thanks @nguidotti !

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-24T11:14:07Z

/ok to test 207fab3

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # cpp/src/mip_heuristics/mip_constants.hpp

nguidotti · 2026-05-26T14:50:02Z

/ok to test cddb787

chris-maes · 2026-05-27T16:32:01Z

          break;
        }
      }
-      lower_bound = node_queue_.best_first_queue_size() > 0 ? node_queue_.get_lower_bound()


Is this correct? Above you are pushing nodes into the worker queue.
But you removed this line that gets the lower bound from the queue if there are nodes in the queue.

Maybe this is happening in this line:

lower_bound = std::min(lower_bound, worker->node_queue.get_lower_bound());

But it's difficult to follow. Maybe split it into two loops. The first loop, goes through the nodes in the queue and fathoms them.

The second loops of over the workers to compute the lower bound.

chris-maes · 2026-05-27T16:35:23Z

+
+      // We need to temporarily save the lower bound in this worker so it is
+      // considered when calculating the global lower bound.
+      this->lower_bound = std::min<f_t>(this->lower_bound, other->node_queue.get_lower_bound());


I'm confused as to what is happening here. Again the comment says we are temporarily saving the lower bound of the worker. But then we don't reset the lower bound

chris-maes

Thanks @nguidotti . I took a brief pass at this PR.

My biggest comment is that after this PR, I find it very hard to reason about the lower bound. We've had many bugs where the lower bound is incorrect or lost. I'm concerned that unless we make the handling of the lower bound as clear as possible to follow there will be more bugs. There are so many places in the code that now track the lower bound. For instance, there is a lower bound member in the node_queue_t. There is a lower bound member in bfs_worker_t (really in branch_and_bound_worker_t), there is a local variable named lower_bound in best_first_search_with there is a function called get_lower_bound, there is the lower bound ceiling.

Is there anyway to simplify this? For instance, do we need to have the lower bound stored in the node_queue_t and the worker? Could we only store it one place?

Can we make clear the comments around temporarily setting the lower bound?

…bound` Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…nsidered in `get_lower_bound` Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-27T19:33:07Z

Thanks @nguidotti . I took a brief pass at this PR.

My biggest comment is that after this PR, I find it very hard to reason about the lower bound. We've had many bugs where the lower bound is incorrect or lost. I'm concerned that unless we make the handling of the lower bound as clear as possible to follow there will be more bugs. There are so many places in the code that now track the lower bound. For instance, there is a lower bound member in the node_queue_t. There is a lower bound member in bfs_worker_t (really in branch_and_bound_worker_t), there is a local variable named lower_bound in best_first_search_with there is a function called get_lower_bound, there is the lower bound ceiling.

Is there anyway to simplify this? For instance, do we need to have the lower bound stored in the node_queue_t and the worker? Could we only store it one place?

Can we make clear the comments around temporarily setting the lower bound?

Unfortunately, no.

Let forget the parallel execution for a moment. When we are doing best-first search with plunges, there are two places where we need to consider the lower bound: the top of the heap (worker->node_queue.get_lower_bound()) and the node that is currently being solved (worker->lower_bound). At the start of the plunge, we pop the best node from the heap, such that the heap does not contains the lowest lower bound anymore. As we do the plunge, the lower bound of the node that it is being currently solved changes and can become worse than the new top of the heap. To avoid locking the queue just for reading the lower bound, I simply use an atomic to track it, which read when call worker->node_queue.get_lower_bound().

nguidotti · 2026-05-27T19:36:14Z

For computing the lower bound for the entire solver, you take
$$L_{global} = \min_k L^{(k)}, \qquad L^{(k)} = \min{L_{heap}^{(k)}, L_{cur}^{(k)}}$$
where $L_{heap}^{(k)}$ is the lower bound from the top of the heap and $L_{cur}^{(k)}$ is the node from the node being currently solved. The k subscript denotes the worker $k$.

nguidotti · 2026-05-27T19:40:11Z

There are other things, we also need to consider. For instance, we might encounter a numerical issue in a given node, which blocks the solver to progress further down that path. Similarly, starting a new worker or stealing a node, the node needs to be transferred from one worker to the next (pop, then push the best node in the heap). In the meanwhile, the lower bound of that node needs to be store somewhere so it can be considered when computing the global lower bound.

nguidotti · 2026-05-27T19:41:35Z

I tried to concentrate all the global lower bound computation in the get_lower_bound in B&B, but some of the values are set during the exploration

nguidotti · 2026-05-27T19:42:06Z

/ok to test 02d0381

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-27T21:18:09Z

/ok to test e212f63

aliceb-nv

Excellent work Nicolas :) Mostly some nitpicks.
I've run your PR on Eos and I see material improvements:

================================================================================
            THEORETICAL BEST RESULTS (Combining Best from Both Runs)
================================================================================
Baseline Run:
  Feasible count: 226
  Gap <= 0.1%: 116
  Average primal gap: 0.135526
  Average dual gap: 0.223231
  Optimal (MIP gap < 1e-4): 60

Compared Run:
  Feasible count: 229
  Gap <= 0.1%: 115
  Average primal gap: 0.121167
  Average dual gap: 0.213514
  Optimal (MIP gap < 1e-4): 66

Theoretical Best (if all regressions fixed):
  Feasible count: 229
  Gap <= 0.1%: 124
  Average gap: 0.111041

Potential Improvement:
  Feasible count improvement: 0
  Average gap improvement: 0.024485
================================================================================

================================================================================
INSTANCES SOLVED TO OPTIMALITY BY COMPARED RUN BUT NOT BY BASELINE (9 instances)
================================================================================
Instance                                   Baseline MIP Gap   Compared MIP Gap
--------------------------------------------------------------------------------
ns1952667                                         100.0000%            0.0000%
peg-solitaire-a3                                  100.0000%            0.0000%
neos-5188808-nattai                                89.0663%            0.0000%
neos-950242                                        25.0000%            0.0000%
rail507                                             0.5747%            0.0000%
neos-4722843-widden                                 0.3150%            0.0000%
neos-4738912-atrato                                 0.0100%            0.0099%
gen-ip002                                           0.0100%            0.0099%
binkar10_1                                          0.0100%            0.0099%

================================================================================
INSTANCES SOLVED TO OPTIMALITY BY BASELINE BUT NOT BY COMPARED RUN (3 instances)
================================================================================
Instance                                   Baseline MIP Gap   Compared MIP Gap
--------------------------------------------------------------------------------
ns1208400                                           0.0000%          100.0000%
triptim1                                            0.0000%            1.3885%
neos-933966                                         0.0000%            0.9346%

aliceb-nv · 2026-05-28T09:16:58Z

+    case LINE_SEARCH_DIVING: return settings.line_search_diving != 0;
+    case GUIDED_DIVING: return settings.guided_diving != 0 && has_incumbent;
+    case COEFFICIENT_DIVING: return settings.coefficient_diving != 0;
+    default: return false;


We shouldn't add a default clause to most switch statements IMO - this allows the compiler to -Werror at us whenever we add a new enum value and we forget to update some of these switches :)

aliceb-nv · 2026-05-28T09:18:31Z

+  void set_inactive() { this->is_active = false; }
+
+  // Steal nodes from another worker
+  bool steal_node_from(bfs_worker_t* other, i_t num_nodes)


Let's use a reference for 'other' since it doesn't semantically make sense for it to be nullptr

aliceb-nv · 2026-05-28T09:29:35Z

+      return steal;
+    }
+
+    while (num_nodes > 0) {


As a side note - do we (or the OMP runtime) have a way to measure and report the amount of contention on omp locks?

aliceb-nv · 2026-05-28T09:35:03Z


 private:
  std::vector<T> buffer;
+  omp_atomic_t<size_t> num_entries_{0};


Why was this added as an atomic?

aliceb-nv · 2026-05-28T09:37:37Z

+  void push(mip_node_t<i_t, f_t>* new_node)
+  {
+    assert(new_node != nullptr);
+    auto entry = std::make_shared<heap_entry_t>(new_node);


What are the ownership rules regarding new_node? Is a std::shared_ptr required here instead of unique_ptr + ownership moves?
I realize this is existing code, I'm just struggling to remember why we made these choices :)

aliceb-nv · 2026-05-28T09:43:06Z

 /* clang-format on */

 #pragma once
+#include <array>


nit: maybe we could stick to a C array here to avoid having to pull in C++ machinery for such a very common header, to keep build times slim

aliceb-nv · 2026-05-28T09:46:31Z

+          worker->node_queue.lock();
+          worker->node_queue.push(node);
+          worker->node_queue.unlock();


Maybe we could turn this into a push_atomic() primitive. That'd lessen the risk of accidentally adding a racey push later down the line. If we want to keep the ability to do unlocked pushes, maybe we could make it explicit in the naming, e.g. push_nolock()?

aliceb-nv · 2026-05-28T09:55:55Z

+  worker->set_inactive();
+  bfs_worker_pool_.return_worker_to_pool(worker);


Does "return_worker_to_pool" always imply "worker" is set inactive? If so, it may be safer to just have "return_worker_to_pool" unconditionally mark the worker as inactive

bdice and others added 30 commits April 3, 2026 13:51

split worker and worker pool in separated file. code cleanup.

e77dbc2

simplified logic for pseudo cost (and its snapshot) for the regular a…

62d0452

…nd deterministic mode. Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>

fixed compilation

a517f13

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

added missing header

f31599c

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

fixed guard against no incumbent when calling guided diving

202738f

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

addressing code rabbit comments. replaced AT in pseudo_costs_t with a…

4aed76c

… shared_ptr to avoid unnecessary copy. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

missing dereference

a5c111d

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into simplify-pseudocost

919e445

split best-first and diving worker into separated objects

76ce1bb

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

increase the wheel size limit

c433e41

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed rng offset

52db538

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

increasing wheel size limit for CUDA 12

3676432

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

first version of the B&B workers with local heaps

d2f6eb7

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

implemented a lock-free stack to track the idle workers. fix potentia…

6a39187

…l crash in work-stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed lower bound calculation at end of the B&B. reverted to locking …

dec671c

…queue for now. refactoring. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

correctly handles the node in the stack when the solver stops if they…

1b3a282

… are present Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

added atomic in node queue to track size and lower bound without a lock.

e108a54

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

replaced std::deque with a circular buffer.

315aca6

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge remote-tracking branch 'upstream/main' into rmm-cccl-migration

536a692

# Conflicts: # cpp/src/utilities/cuda_helpers.cuh

Inline upstream memory resource variable in test fixture MR composition

31a6eab

Replace deprecated rmm::mr set_*_resource_ref calls with set_*_resource

f889d28

renamed method

3469026

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into simplify-pseudocost

8e8c794

# Conflicts: # ci/validate_wheel.sh

Merge branch 'main' into simplify-pseudocost

3e6aa83

merging with main branch

e0444c2

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed compilation

f3e863f

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge remote-tracking branch 'upstream/main' into rmm-cccl-migration

76c9ece

fixed small bugs

56bf9ed

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

added cleanup routine for the diving heap

18e1e83

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'release/26.06' into bnb-local-heap

cbaae26

# Conflicts: # cpp/src/branch_and_bound/mip_node.hpp # cpp/src/branch_and_bound/pseudo_costs.cpp

Kh4ster approved these changes May 21, 2026

View reviewed changes

nguidotti added 3 commits May 24, 2026 12:36

fixed missing time limit status

bd4c631

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'release/26.06' into bnb-local-heap

8f587d9

merge with release/26.06 branch

207fab3

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti added 3 commits May 26, 2026 14:11

investigating missing timeout log line

bd1e729

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

removed debug prints

e7d3628

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'release/26.06' into bnb-local-heap

cddb787

# Conflicts: # cpp/src/mip_heuristics/mip_constants.hpp

nguidotti mentioned this pull request May 26, 2026

Expose diving hyper parameters + Vector length/Farkas diving #1298

Open

8 tasks

chris-maes reviewed May 27, 2026

View reviewed changes

Comment thread cpp/src/dual_simplex/initial_basis.cpp Outdated

chris-maes reviewed May 27, 2026

View reviewed changes

Comment thread cpp/src/dual_simplex/initial_basis.cpp Outdated

chris-maes reviewed May 27, 2026

View reviewed changes

Comment thread cpp/src/branch_and_bound/branch_and_bound.cpp Outdated

chris-maes reviewed May 27, 2026

View reviewed changes

Comment thread cpp/src/branch_and_bound/branch_and_bound.cpp

chris-maes reviewed May 27, 2026

View reviewed changes

Comment thread cpp/src/branch_and_bound/branch_and_bound.cpp Outdated

chris-maes reviewed May 27, 2026

View reviewed changes

chris-maes requested changes May 27, 2026

View reviewed changes

nguidotti added 3 commits May 27, 2026 21:08

move most of the logic for the lower bound computation to `get_lower_…

0ff738f

…bound` Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

simplified logic on the heap cleanup

20e874a

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

simplified expression as the lower bound from the queue is already co…

02d0381

…nsidered in `get_lower_bound` Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fix order for updating the lower bound. fix compilation

e212f63

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

aliceb-nv reviewed May 28, 2026

View reviewed changes

		worker->set_inactive();
		bfs_worker_pool_.return_worker_to_pool(worker);

Conversation

nguidotti commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Kh4ster left a comment

Choose a reason for hiding this comment

Uh oh!

nguidotti commented May 24, 2026

Uh oh!

nguidotti commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chris-maes left a comment

Choose a reason for hiding this comment

Uh oh!

nguidotti commented May 27, 2026

Uh oh!

nguidotti commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nguidotti commented May 27, 2026

Uh oh!

nguidotti commented May 27, 2026

Uh oh!

nguidotti commented May 27, 2026

Uh oh!

nguidotti commented May 27, 2026

Uh oh!

aliceb-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

nguidotti commented Apr 27, 2026 •

edited

Loading

nguidotti commented May 27, 2026 •

edited

Loading